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Abstract — This paper describes the results of a significant re- 
search and development effort conducted at NASA Ames Re- 
search Center to develop new text mining techniques to dis- 
cover anomalies in free-text reports regarding system health 
and safety of two aerospace systems. We discuss two prob- 
lems of significant import in the aviation industry. The first 
problem is that of automatic anomaly discovery about an 
aerospace system through the analysis of tens of thousands 
of free-text problem reports that are written about the sys- 
tem. The second problem that we address is that of automatic 
discovery of recurring anomalies, i.e., anomalies that may be 
described in different ways by different authors, at varying 
times and under varying conditions, but that are truly about 
the same part of the system. The intent of recurring anom- 
aly identification is to determine project or system weak- 
ness or high-risk issues. The discovery of recurring anom- 
alies is a key goal in building safe, reliable, and cost-effective 
aerospace systems. 

We address the anomaly discovery problem on thousands of 
free-text reports using two strategies: (1) as an unsupervised 
learning problem where an algorithm takes free-text reports 
as input and automatically groups them into different bins, 
where each bin corresponds to a different unknown anomaly 
category; and (2) as a supervised learning problem where the 
algorithm classifies the free-text reports into one of a number 
of known anomaly categories. We then discuss the applica- 
tion of these methods to the problem of discovering recurring 
anomalies. In fact, the special nature of recurring anomalies 
(very small cluster sizes) requires incorporating new methods 
and measures to enhance the original approach for anomaly 
detection. 

We present oar results on the identification of recurring 
anomalies in problem reports concerning two aerospace sys- 
tems. The first system is. the Aviation Safety Reporting 
System (ASRS) database, which contains several hundred- 
thousand free text reports filed by commercial pilots con- 
cerning safety issues on commercial airlines. The second 
aerospace system we analyze is the NASA Space Shuttle 
problem reports as represented in the CARS dataset, which 


consists of 7440 NASA Shuttle problem reports. We show 
significant classification accuracies on both of these systems 
as well as compare our results with reports classified into 
anomalies by field experts. 

Keywords — Target detection, adaptive tests, sequential detec- 
tion. 

Table of Contents 

1 Introduction 

2 Brief Look at Classification Methods 

3 Vector Space Model 

4 Directional Statistics 

5 The YMF algorithm 

6 Robustness of the algorithm 

7 Datasets Used 

8 Simulation Results 

9 Text Classification of Flight Reports to 
Occurring Anomalies 

10 Recurring Anomaly Detection 

11 Conclusions and future work 

1 . Introduction 

Aerospace systems have a voluminous amount of information 
in the form of structured and unstructured text documents, 
much of it specifically relating to reports of anomalous be- 
havior of craft, craft subsystem(s), and/or crew. Mining this 
document database can result in the discovery of valuable in- 
formation regarding system health monitoring. 

In this direction, content based clustering of these reports 
helps detect recurring anomalies and relations in problem re- 
ports that indicate larger systemic problems. Clustering and 
classification methods and results will be presented using the 
Aviation Safety Reporting System (ASRS) database. The 
clustering results for two standard publicly available datasets 



will also be shown to allow method comparison to be per- 
formed by others. 

Clustering and classification techniques can be applied to 
group large amounts of data into known categories. The sec- 
ond problem addressed in this paper is to then autonomously 
identify recurring anomalies. This approach will be presented 
and results shown for the CARS dataset This wort has ex- 
tended uses, including post-analysis for military, factory, au- 
tomobile and aerospace industries. 

2. Brief Look at Classification Methods 

A wide variety of methods in the field of machine learning 
have been used to classify text documents. [loachimsl claims 
that most text categorization problems are linearly separable 
making them ideal candidates for Support Vector Machines 
(SVMs). In [], he makes an attempt to bring out the statisti- 
cal similarity between the parametric and non-parametric ap- 
proaches for classification. 

Non- Parametric Methods 

The non-parametric methods in the classification of text doc- 
uments are generally algorithms like Kmeans and Nearest 
Neighbor classification. Consider a set of data points dis- 
tributed in a d dimensional space. Kmeans chooses a set of 
initial points as the seeds. In step one, each document in the 
dataset is associated with that seed document to which it has 
the minimum Euclidean distance. This results in the clas- 
sification of documents into k clusters, ha step 2, die seed 
associated with each cluster, is updated to the mean of all 
document vectors in that particular cluster. With the updated 
seeds, step 1 is repeated again and the process continues it- 
eratively. The documents get assigned to different clusters 
and the seeds keep getting updated. The algorithm converges 
when either the seeds stop getting updated or the documents 
are no longer assigned to different clusters during each iter- 
ation. In the following sections we will bring out how this 
heuristic algorithm is related to the gaussian mixture modeL 

Parametric Methods 

These can loosely be classified as a group of methods that 
involve parameter estimation. Any mixture model, in partic- 
ular, a mixture of distributions from the exponential family, 
can be considered a good example. The underlying random 
variable could be generated from any one of the distributions 
in the mixture model, with a probability equal to the prior 
probability associated with that particular distribution. 

Gaussian Mixture Models 

Hie gaussian mixture model assumes that the text documents 
were generated using a mixture of k gaussian distributions, 
each with its own parameters 9 


k 

y aif(x/9i) (i) 

i=l 

such that = L where on is the prior probability of the 
ith distribution. Each density is representative of a particular 
category of documents. If there are k categories in a docu- 
ment database, then this situation can he typically modeled 
using a mixture model of k distributions. 

Expectation Maximization Algorithm and its application to 
Text Classification 

The expectation maximization algorithm is an iterative ap- 
proach to calculate the parameters of the mixture model men- 
tioned above. It consists of two steps: The Expectation step 
or E-step and Maximization step or the M-step. In the E- 
step, the likelihood that the documents were generated using 
each distribution in the mixture model is estimated. The doc- 
uments are assigned to that cluster whose representative prob- 
ability density function has die highest likelihood for gener- 
ating the document This results in the classification of docu- 
ments into one of the n classes, each represented by a particu- 
lar probability density function. In the M-step, the maximum 
likelihood estimates of the parameters of each distribution is 
calculated. This step uses the classification results of the M- 
step, where each class is assigned a set of documents. We 
will attempt to explain the E-step and M-step in the context 
of the gaussian mixture model. Let us assume that we have M 
data points that we want to fit using a mixture of K univari- 
ate Gaussian distributions with identical and known variance. 
The unknowns here are the parameters of the K gaussian dis- 
tributions. Also the information on which data point was gen- 
erated using which of the distributions in the mixture is un- 
known. Each data point Y m is associated with K hidden vari- 
ables {w TO) i, w mi2 , w mj 3 , . . . , ty TO ,fc} where w m ,k = 1, if Y m 
was generated using distribution k, otherwise w mj k — 0. The 
ML Estimate of the mean pk of the kth distribution is given 

by. 
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Pk — tUm.fcLn (2) 
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where M k = Em= l W ™N 

The problem is that we know neither the value of Pk nor the 
hidden variables w mi k . 

E step: The expected values of the w m ,k are calculated, based 
on assumed values or current estimates of the gaussian para- 
meters pk- 
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This corresponds to clustering data points by minimizing the 
Euclidean distances in Ihe k-means algorithm. 

M step: Using the Expected values of w m ^ the ML estimates 
of p-k ace calculated. This corresponds to updating the seeds 
of clusters centers at every iteration of the k-means algorithm. 
Or in other words the M step corresponds to recalculating 
the seeds of the kmeans algorithm. The center of the cluster 
corresponds to the mean of all the documents or data points 
in the corresponding cluster. 

Thus the k-means algorithm is a special implementation of 
the Gaussian Mixture Model, which models the distribution 
of the underlying data points as a mixture of Gaussian distri- 
butions. The parameters are determined by the iterative Ex- 
pectation Maximization (EM algorithm) of the log likelihood 
function. The algorithm, however, does not work on sparsely 
located data points in a high dimensional space. 

3. Vector Space model 

The vector space model is a classical way of representing text 
documents. This representation helps apply machine learn- 
ing techniques to document classification. A database of 
text documents can be represented in the form of a Bag Of 
Words (BOW) matrix. Each row of the BOW matrix rep- 
resents a document and the columns are given by the union 
of all words in all the documents. Each word is associated 
with a Term Frequency (TF), which is given by die total 
number of times a word occurs in the document Document 
Frequency is defined as the total number of documents in 
which the word it/* occurs. The (i, j)th cell of the BOW ma- 
trix corresponds to the TFIDF, which is the Term Frequency 
Inverse Document Frequency of die jth word in the docu- 
ment The TFEDF is defined as: TFIDF = TF.IDF, where 
IDF{wi ) = log(n/DF(wi)). 

Here n is the total number of documents in the document 
database. Thus each text document is represented as a point in 
a high dimensional vector space. The BOW matrix is of huge 
dimension and variety of techniques like Principle Compo- 
nent Analysis (PCA), Singular Value Decomposition (SVD) 
and Information Theoretic approaches have been used to re- 
duce the dimensionality of the vector space. 

4. Directional Statistics 

Directional statistics is a field of statistics dealing with the 
statistical properties of directional random variables. For ex- 


ample, the random variable representing the position of a 
roulette wheel can be said to exhibit directional statistics. 

Why Use Directional Distribution for Text Data 

The preprocessing step before applying the algorithms to text 
data involves normalization. The TFIDF document vectors 
are L2 normalized to make them unit norm. Here the as- 
sumption is that the direction of documents is sufficient to 
get good classification and hence by normalization, the effect 
of the length of the documents if nullified. For Eg: Two doc- 
uments - one small, one lengthy - on the same topic will have 
Ihe same direction and hence put in the same cluster. If the 
dimension of the vector space before normalization is R d , 
the unit normalized data lives on a sphere in an R d_1 dimen- 
sional space. Since it is spherical data, it is more appropriate 
to use directional distributions. 

The von Mises Fisher Distribution 

Von Mises Fisher distribution is one of the directional distrib- 
utions. It was developed by Von Mises to study the deviations 
of measured atomic weights from integer values. Its impor- 
tance in statistical inference on a circle is almost the same as 
that of the normal distribution on a line. 

VMF distribution for a two dimensional circular Random 
Variable: A circular random variable 0 is said to follow a von 
Mises Distribution if its p.d.f. is given by: 


s(0; Fo, k) 


ex p/ccoBp-*,), 

o < o < 20, k > 0,0 < p a < 20, (4) 


where I 0 (k ) is the modified bessel function of the first kind 
and order zero. The parameter p a is the mean direction while 
the parameter k is described as the concentration parameter. 
A unit random vector x is said to have d variate von Mises- 
Fisher distribution if its pdf is: 
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where|| p || and k > 0. The closed form expression for k is 
given by: 


C p (k) = 
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The Choice of VMF among all other spherical distributions 

This section analyzes the appropriateness of using the Von 
Mises Distribution for text classification among all other 
spherical distributians. Is there a Central limit theorem(CLT) 



for Directional data? Does it correspond to the CLT for non- 
directional data? For data on a line, the CLT says that the 
Normal distribution is the limiting distribution. Whereas for 
directional data, the limiting distribution of the sum of n in- 
dependent random variables is given by the Uniform Distri- 
bution. In spite of this, the Uniform Distribution is hardly a 
contender for modeling directional data [4]. 

Relation to bivariate normal distribution: The VMF shows 
several analogies to the properties of the normal distribution. 
Due to space limitations we will discuss briefly a few of such 
analogies. Maximum Likelihood Characterization: Consider 
the distribution of a random variable on the real line. Let 
f(x — n) represent the distribution where p is the mean. The 
maximum likelihood estimate for p is given by the sample 
mean if and only if the distribution is gausskn. Similarly, for 
a random variable 9 on a circle, let the directional distribution 
be given by g(9 — p 0 ). The Maximum Likelihood estimate 
for the mean p 0 is given by the sample mean x Q , if and only if 
the directional distribution is given by the VMF distribution. 
Maximum Entropy Characterization: Given a fixed mean and 
variance for a random variable x, the Gaussian is the distri- 
bution that maximizes the entropy. Likewise given a fixed 
circular variance and mean direction the VMF distribution 
maximizes the entropy. 

Unfortunately there is no distribution for directional data 
which has all properties analogous to the linear normal distri- 
bution. The VMF has some but not all of the desirable prop- 
erties. The wrapped normal distribution is a strong contender 
to VMF. But the VMF provides simpler ML estimates. Also 
the VMF is more tractable while doing hypothesis testing. 
Hence the use of VMF over other directional distributions is 
justified. 

5. THE VMF ALGORITHM 

In this section we will discuss the theory behind modeling 
the text documents as a mixture model of VMF distributions. 
Consider a mixture model consisting of K VMF distributions 
similar to (1). Each distribution is attributed a prior probabil- 
ity of afc with 1 “fc — 1 and a k > 0. It is given by: 
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Here© = {ai,a2,...,ah,9i,92,---,9k}- &k> = 

Let Z — {zi,.,..zn} be the hidden variables associated with 
file document vectors X = {xi, Xn, ■ • - , xn}- zr == k, if the 
document vector xi was generated from the kth VMF distrib- 
ution. Assuming that the distribution of the hidden variables 
p(k/x, ©) = p(zj = k/x = Xi, ©) is known, the complete 
log likelihood of the data is given by with expectation taken 
over the distribution p, is given by. 
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The Maximization Step: In the parameter estimation step or 
maximization step, we estimate © by maximizing (8). By 
taking partial derivatives of (8) wxt the parameters, the ML 
estimates are given by: 
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The ML update for k, obtained after approximations is given 
by: 
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where r k = 

The Expectation Step: Assuming that the ML updates cal- 
culated from the above step are right, the expectation step, 
updates the distribution of the hidden variables Z. There are 
two ways of assigning the documents to clusters: the soft and 
hard assignments. The distribution of the hidden variables as 
considered in the soft assignment scheme: 


p(k/xi, ©) = 


«kfk(Xj/0) 
Sk=l a kfk(Xi/0) 


( 12 ) 


Under the hard assignment scheme, the update equations are 
given by: 

q(k/x i, ©) = 1 if k = argmaXk'q(k' /x,, ©) 

0, otherwise (13) 


So according to (13), the documents either belong to a cluster 
or they do not There is no notion of the documents belonging 
to several clusters. There is no one to many mapping between 
the document and cluster domains. In practise this may be 
disadvantageous because some data sets like the Reuters data 
set have multi-labeled documents. Few of the most popular 
classes in the Renters dataset are ACQ, CORN, WHEAT and 
EARN, hi this case, there are documents that belong to ACQ, 
EARN and WHEAT. It would be impossible to get this kind 
of categorization using the hard assignment scheme. 



6. Robustness of the algorithm 


8. Simulation Results 


5 


Although the update equations for the VMF algorithm de- 
rived in the previous section have closed fprm expressions, 
when the dimensionality of the vector space expands, the cal- 
culations become untractable because of the huge numbers 
involved. This gave simulation issues when the algorithms 
were implemented. So in order to overcome this problem, 
mathematical approximations were plugged into the update 
equations. For a modified bessel function of the first kind and 
order n, for large x, fixed n and X » n, the approximation 
is given as follows: 


Wl) ~ 7SE <14) 


7. Datasets Used 

We have experimented with several data sets standardly used 
for text classification. 

The 20 News Groups data set: It is a collection of 19997 
documents belonging to 20 different news groups. Since the 
documents in this dataset are primarily email messages, head- 
ers such as from, to, subject, organization etc were removed 
in the preprocessing step. We had an extensive stop word 
list, which was also removed from the documents. We tried 
to eliminate as many special characters as possible in order 
not to skew the results of the clustering algorithm. Removing 
these helps in dimensionality reduction. We were interested 
only in the body of the messages to keep it a free text classi- 
fication exercise. 

The Diff3 and Sim3 datasets were created from the 20 New- 
Groups dataset, to verify the performance of the algorithm in 
well separated classes of documents and documents classes 
that are closely related to each other in terms of content Also 
the size of the dataset has a bearing on the classification ac- 
curacy. The more the number of samples to learn the distrib- 
ution, the better the classification results. So the sim3-small 
and diff3-small datasets are created with only 100 documents 
from each class in them. 

The CARS Data set The cars dataset is a collection of prob- 
lem reports generated by engineers in different fields for the 
problems in the shuttle. It contains .... documents with a total 
of.... words in it 

The Reuters dataset It is the most widely used dataset in text 
categorization research. It is a collection of 21578 documents 
each belonging to multiple classes. 

The Yahoo INews Groups Dataset This dataset consists of a 
collection of 2340 documents belonging to 20 different cate- 
gories. 


Mutual Information: Mutual Information is used as the crite- 
ria for comparing the performance of the different methods on 
the various data sets. Consider two random variables x and y. 
Mutual Information is generally used in statistics to measure 
the degree of information that be obtained about one random 
variable by knowing the value of another random variable. 
Let p(x) and p(y) be the marginal distributions of x and y 
and let fire joint distribution be p{x, y ). The Mutual Informa- 
tion between x and y is defined as: 

x y 

We used the Mutual Information between the vector of class 
labels vector produced by the algorithms and the actual class 
labels of the documents as the criterion to compare the per- 
formance of the different algorithms. 

To be included: Performance Curves: Comparison of VMF 
Vs Kmeans: Mutual Information Vs the Number of clusters 
(averaged over 20 iterations) 

« 20 News Groups Diff3 Dataset 

• 20 News Groups Sim3 Dataset 

• Small Sim3 
. Small DifiB 

• Yahoo News Groups 

• Reuters Dataset 

Confusion Matrices to be included Classification confusion 
matrices for some / all of the above datasets. 

Also examples of the top frequency words in each cluster and 
how they can be representative keywords for the clusters can 
be included. 

9. Text Classification of Flight Reports 
to Occurring Anomalies 

Problem Definition 

After each commercial flight in the US, a report is written 
on that flight describing how the flight went and whether any 
anomalous events have happened. There is a number of pre- 
defined anomalies which can occur in the aircraft during a 
flight The goal of text classification is to develop a system 
that based on the semantic meaning of a report infers which, 
if any, anomalies have occurred during a flight for which a 
report has been written. 

The work at the semantic level has already been done and 
we are given the reports in a ’’bag of words/terms”, which 
contains for all reports their tenns, extracted by Natural Lan- 
guage Processing methods, and the corresponding frequen- 
cies of the terms. There are a total of 20,696 reports, a total 
of 28,138 distinct terms, and a total of 62 different anom- 
alies. The anomalies are named with their codes ranging from 
413 to 474. A report can have between 0 and 12 anomalies. 



Whether a particular anomaly has occurred or not is labeled 
by 1 and 0 respectively in the training data set Most reports 
(over 90 % of them) contain more than 1 anomaly, with the 
most common group of reports containing exactly 2 anom- 
alies (5,048 reports). The most frequent anomaly occurs in 
almost half of the reports. 

System Overview 

By running association rules on the anomaly labels, we found 
out that there is not any strong correlation among different 
anomalies. We concluded that each anomaly has to be treated 
individually. We, thus, treat the multi-label classification 
problem as a binary classification problem for every anom- 
aly. As an initial step we pick to work with 12 of the 62 
anomalies and try to find a classifier that will perform best 
for each of them. Our approach can be summarized in three 
main phases. In the first phase we load the data into a data- 
base, collect statistics on it for the purposes of studying the 
data, then remove the terms with very low frequency. In the 
second phase we run common feature selection algorithms to 
reduce the feature space by picking the best terms for every 
anomaly. In the final phase we experiment with several com- 
monly used for test classification algorithms, such as Support 
Vector Machines, Naive Bayes, AdaBoost, Linear Discrim- 
inant Analysis (LDA), Logistic Regression, implemented in 
the open-source packages WEKA [1], SVM-light [2] and R. 
We show convincingly that SVM, with an RBF kernel in 
particular, performs best for this particular text classification 
problem. Figure 4.1 summarizes our architecture. 



Figure 1. System Architecture 
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Removal of low frequency terms 

We remove ail terms, regardless of their frequencies, which 
appear in exactly one report The intuition behind this is that 
those terms are not frequent enough to be used for training 
and will most likely be never seen in the test data. Also, since 
even the low frequent anomalies occur in at least hundreds of 
reports, we do not expect much contribution of the rare terms 
to the classification problem. After the removal of those rare 
terms, the total number of terms left is 17,142. 
r 


Feature Selection 

In Ihis phase we perform feature reduction by selecting the 
most informative terms for every anomaly [5][6]. We use the 
Information Gain criterion to rank the terms according to how 
informative they are for a specific anomaly: 

IG(class,term ) = H (class') — H(class\term ) (16) 

where H (class) denotes the entropy of a specific anomaly, 
and H (class\term) denotes the conditional entropy of an 
anomaly given a particular term. For every anomaly we ex- 
perimentally find out which is the optimal number of terms. 
This is an iterative process and includes picking different 
numbers of best terms for each anomaly and then running 
several different classifiers and analyzing the performance re- 
sults. For some anomalies it is best to keep the top 1000 
ranked terms out of 17,142 and for some others this num- 
ber is 500 or 1500. For efficiency purposes we set 1500 as 
an upper threshold of the number of terms we would work 
with. Working with just the best 500, 1000, or 1500 terms for 
each anomaly helps speed up the classification process and 
at the same time increases the classification accuracy. Figure 
4.2 shows comparison of the F-Measure (the harmonic mean 
between precision and recall) results of the class of reports 
having an anomaly, when different number of best terms is 
picked for each anomaly. The classifier used for that compar- 
ison is SVM with a linear kernel and default parameters. 



Figure 2. Figure 4.2. Number of terms, comparison 
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An observation is that anomalies that are not occurring so 
frequently are classified more accurately with less number of 
terms. This seems rather reasonable since it makes sense that 
less frequently occurring anomalies would be described well 
enough with just a few terms. 

Experimenting with different classifiers 

After we select the optimal number of terms for each anom- 
aly, we test different methods for classification. We experi- 
ment with Naive Bayes, Adaboost, SVM, LDA, Logistic Re- 
gression. At that point we want to find which method would 
give the best classification accuracy across all anomalies. The 





histogram in Figure 4.3 shows the comparison on the Overall 
Precision (both classes) for those methods: 




Overall Precision 
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Figure 3. Figure 4.3. Classifiers comparison 
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We use the implementation of SVM in both Weka and S VM- 
light, and the Weka implementations of Naive Bayes and Ad- 
aBoost with base learner Naive Bayes. SVM with a linear 
kernel performs best on all anomalies. We, therefore choose 
to experiment further mainly with the SVM classifier, al- 
though later we do make comparisons with two other com- 
mon classification methods - LDA and Logistic Regression. 

Support Vector Machines for text classification 

Support Vector Machines are based on the structural risk min- 
imization principle from statistical learning theory [3]. In 
their basic form SVMs learn linear decision rules h(x) = 
sign{ wx} described by a weight vector w and a thresh- 
old 6. Input is a sample on n training examples S n = 
((xi,yi), (£k,Vn)), Xi € FPtfi e {-1,4-1}. For a lin- 
early separable S n , the SVM finds the hyperplane with max- 
imum Euclidean distance 5 to the closest training examples. 
For non-separable training sets, the amount of training error 
is measured using slack variables &. Computing the hyper- 
plane is equivalent to solving an optimization problem: 


minimize . : V ( w , 6, £) =1 /2ww + C (17) 

i— 1 


subject to : V7 =1 : yi[wx + &]>! — £» (18) 


and : V? =1 : & > 0 (19) 

The constraints (2) require that all training examples are clas- 
sified correctly up to some slack &. If a training example lies 
on the wrong side of the hyperplane, the corresponding £» is 
greater or equal to 1. Therefore, Ya=i & is an upper bound 
on the number of training errors. The parameter C in (1) al- 
lows trading off training error and model complexity. 


SVMs work well in text classification [4] for a number of 
reasons: 

1. Text normally has high dimensional input space. SVMs 
use overfitting protection which does not depend on the num- 
ber of features and therefore have fire potential to handle large 
feature spaces. 

2. Document vectors are sparse and SVMs are well suited for 
problems with sparse instances. 

3. Most text classification problems are linearly separable. 
SVMs easily find linear (and for that matter polynomial, RBF, 
etc) separators. 

SVMs can be implemented with different kernels and for the 
task of Text classification most popular are the linear, polyno- 
mial and RBF kernels. We experiment with all those kernels 
after we normalized the frequencies of terms remaining af- 
ter the feature reduction. Let /y be the frequency of term f 
in document dj. Then based on our normalization, the new 
frequency /L of every term is: 

= ( 20 ) 


withJ2(fl j ) = 1 (21) 


Our normalization differs from the unit length normalization, 
which we also tried but did not obtain desirable results. We 
experiment with the kernels that we mentioned above and re- 
sults of the anomalous class F-Measure are shown in Figure 
4.4. As one can observe, RBF kernel works best for almost all 
anomalies. In Figure 4.5 we show the recall-precision graph 
for one of the anomalies (code 413). It is evident from the 
graph that for a relatively low recall we can achieve very high 
precision. 


Kernels Comparison. Anomalous F-Measure. Normalized Dataset 
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Figure 4. Figure 4.4. Kernels comparison 
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Results of the break-even point (precision — recall) for all 
anomalies are presented in Figure 4.6. From those results, we 
can conclude that for some anomalies we get lower quality 
predictions than for others. In other words, some anomalies 





Figure 5. Figure 4.5. Recall-Precision graph for anomaly 
413 


are much harder to classify than others. The problem with 
the harder to classify anomalies can be related to the initial 
"bag of words” where the terms picked for those anomalies 
are apparently not descriptive enough. 



Figure 6. Figure 4.6. Break-even point for the anomalous 
class of 12 anomalies, SVM 


The S YM training and classification are very fast in the SVM- 
light package. Training and 2-fold cross validation on 20,696 
reports takes about 2 minutes on average on a 2 Ghz Pentium 
HI Windows machine with 512MB of RAM. 

SVM results comparisons with LDA and Logistic Regression 
results 

Our emphasis is to predict accurately especially on the class 
that contains a specific anomaly. In other words, we want to 
be particularly accurate when we predict that an anomaly is 
present in a report We call that the anomalous class. Since 
the frequency of anomalies across reports varies from about 
50% to less than 1%, we want to get both high precision and 
high recall on the anomalous class. That is why we deem 
using the break-even point of the anomalous class as an eval- 
uation metric to be the most meaningful method of evaluating 
our results. In Figure 4.7 we show the break-even comparison 
of fire SVM (RBF kernel) results on the 12 anomalies shown 
above (Figure 4.6) with the break-even results obtained from 


commonly used by statisticians LDA and Logistic Regression 
classifiers. 



Figure 7. Figure 4.7. Break-even point for the anomalous 
class of 12 anomalies, comparison among SVM, LDA, Lo- 
gistic 


The results obtained with SVM with an RBF kernel are very 
good with average anomalous break-even point for all anom- 
alies of 63% and highest of 78%. The non-anomalous average 
break-point is at the 90%+ level. The break-even results using 
LDA and Logistic have weighted average anomalous break - 
even points of 57.26% and 49.78% respectively. Moreover, 
using Logistic, on 4 of the 12 anomalies, a break-even point 
could not be produced, and using LDA on 1 of the 12 anom- 
alies. The robust SVM classifier easily produces break-even 
points for all anomalies. On each of the 12 anomalies it out- 
performs LDA by 5%-7% on average and Logistic by 10%- 
15% on average. 

10. Recurring Anomaly Detection 

The Recurring anomaly detection problem that we address in 
this paper is as follows. Given a set of N documents, where 
each document is a free text English document that describes 
a problem, an observation, a treatment, a study, or some other 
aspect of the vehicle, automatically identify a set of poten- 
tial recurring anomalies in the reports. Note that for many 
applications, The corpus is too large for a single person to 
read, understand, and analyze by hand. Thus, while engineers 
and technicians can and do read and analyze all documents 
that are relevant to their specific subsystem, it is possible that 
other documents, which are not directly related to their sub- 
system still discuss problems in the subsystem. While these 
issues could be addressed to some degree with the addition of 
structured data, it is unlikely that all such relationships would 
be captured in the structured data. Therefore, we need to de- 
velop methods to uncover recurring anomalies that may be 
buried in these large text stories. Overall recurring anomaly 
detection helps to identify system weakness and avoid high- 
risk issues. The discovery of recurring anomalies is a key 
goal in building safe, reliable, and cost-effective aerospace 
systems. Furthermore, recurring anomaly detection can be 
applied to other domain , such as computer network security 
and health care management 






From the research perspective, recurring anomaly detection 
is an unsupervised learning problem. The task of recurring 
anomaly detection has not been addressed by prior work, be- 
cause of the unique structure of the problem. The research 
most closely related to recurring anomaly detection is per- 
haps the Novelty and Redundancy Detection in Adaptive Fil- 
tering. [7]. A novelty and redundancy detection distinguishes 
among relevant documents that contain new (novel) informa- 
tion and relevant documents that do not . The definition of 
recurring anomaly in our problem matches the definition of 
redundancy. The difference between them lies in two aspects: 

1. Novelty and Redundancy Detection processes the docu- 
ments in sequence, and recurring anomaly detection does not 

2. Recurring Anomaly Detection groups recurring anomalies 
into clusters, and Novelty detection does not Another re- 
search field related to recurring anomaly detection is retro- 
spective event detection task in Topic Detection and Tracking 
[8] [9]. The retrospective detection task is defined to be the 
task of identifying all of die events in a corpus of story. Re- 
curring anomaly detection task differs from their task in hav- 
ing many single document clusters. However , the similarity 
of the tasks are worth exploring, and several methods we in- 
vestigated are motivated by their work. The core part of our 
work is the similarity measures between statistical distribu- 
tions. There has been much work on similarity measures. A 
complete study On distributional similarity measures is pre- 
sented by [10]. 


Language Models and Similarity Measures 

There are two general approaches to measure die similar- 
ity between documents: non statistical method and statistical 
method. One of the typical non statistical methods is cosine 
distance, which is a symmetric measure related to the angle 
between two vectors. It is essentially the inner product of the 
normalized document vectors. If we present document d as a 
vector d = (wi(d) , W 2 (d) , , . . , w n (d)) T , then: 


cos(d t , dj ) 
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The statistical method is to measure the similarity between 
different distributions. Each distribution generates one docu- 
ment, while in die generative model we used in the previous 
section each distribution generates a cluster of documents. Li 
our recurring anomaly detection problem, there are many sin- 
gle document clusters. In a statistical sense, single document 
cluster is a single sample generated by the underlying distrib- 
ution. The reason that we do not use von Mises Fisher (VMF) 
distribution, which we used in the previous section, is that we 
can not estimate the mean and the variance unless we have 
certain amount of data in each cluster. To estimate the pa- 
rameters of VMF distribution with single sample returns the 
mean as the document vector itself and zero variance. 

The statistical language model used in most previous work 
is the unigram model. This is the multinomial model which 
assigns the probability of the occurrence of each word in the 


document 

p(d) = Y[p( Wi ,dyfM 

Wi 


where p(wi, d) is the probability that word i occured in doc- 
ument d, and tf(wi, d) indicates how many times word i oc- 
cnred in the documents. 


Clearly, now the problem essentially reduced to a multino- 
mial distribution parameter estimation problem. The maxi- 
mum likelihood estimation of the probability of a word oc- 
curring in the document is 


p{wi\d) = 


tf(wj, d) 


Furthermore, we use an algorithm based on generative model 
of document creation. This new mixture word model mea- 
sure is based on a novel view of how relevant documents are 
generated. We assume each recurring anomaly document is 
generated by the mixture of three language models: a general 
English language model , a user-specific Topic model , and 
a document-specific information model. Each word is gen- 
erated by each of the three language models with probability 
X E ,X T and X dcm - e respectively: 


P{Wi\0 E ,d T , Qdcore, Xe, X T, A*. ore ) — 

X E P{wi\0E ) + X T P(wi\d T ) + X dcurK P{w i \0 {U:OTe ) 
where X E + Ay + A dcore = 1. 

For instance, in a short document “the airplane engine has 
some electric problems.”, the words “the” , “is” and “some” 
probably come from the general English model, words such 
as “airplane” and “problem” are likely generated from the 
Topic model, and the words “engine” and “electric” are gen- 
erated from the new information model . Because all the doc- 
uments are anomaly reports on airplane, the documents axe 
likely to contain words like “airplane” and “problem”. The 
information contained in the document specific model is use- 
ful to detect recurring anomalies caused by different prob- 
lem. So only measuring the similarity between the document 
specific models makes the recurring anomaly detection more 
accurate. 

If we fix X e ,Xt and X doOTe , then there exists a unique opti- 
mal value for the document core model that maximizes the 
likelihood of the document. 

We employ quick algorithm based on Lagrange multiplier 
method to find the exact optimal solution, given fixed mix- 
ture weights [11]. 

We need some metrics to measure the similarity between 
multinomial distributions. Kullback-Leibler divergence, a 



ffistributional similarity measure, is one way to measure the 
similarity of one multinomial distribution given another. 
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The problem with KL divergence is that if a word never oc- 
curs in document , it will get a zero probability p(wi\d) = 0 . 
Thus a word in not in d t but in dj will cause KL(&dt , 9^ ) = 
oo. 


To avoid the singularity of KL divergence, we resort to other 
measurements: Jensen-Shannon divergence, Joccard’s Coef- 
ficient and skew divergence. Jensen-Shanon divergence [10] 
has been proved to be a useful symmetric measure of the dis- 
tance between distributions 

JS{9d t ,0dj) ~ -[KL(9dt,a + KL{6 dj ,avg dt ,dj)} 

We also employ skew divergence [10] to measure die similar- 
ity between two discrete distributions. Skew divergence is an 
asymmetric generalization of the KL divergence, 


Sk(6 dt ,6^) = KL(Q dt , (1 - a)0 dt + atf*) for 0 < a < 1 


Note that at a — 1 , the skew divergence is exactly the KL di- 
vergence, and at a = 0.5 , the skew divergence is twice one of 
the summands of Jesen-Shannon divergence . In our experi- 
ment, we choose a = 0.99 to approximate the KL divergence 
and avoid singularity. 


The Joccard’s coefficient differs from all the other measures. 
We consider in that it is essentially combinatorial, being 
based only on the sizes of the supports of document specific 
distribution rather than the actual value of file distribution 


Jac(8d, , 6<ii) 
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Based on the similarity measurement between anomaly 
documents, we apply aggldmerative hierarchical clustering 
method to pardon the documents. The aggolomerative hi- 
erachial Algorithm produces a binary tree of clusters in a 
bottom-np fashion: the leaf nodes tree are single document 
clusters; a middle-level node is the centriod of the two most 
proximate lower level clusters; and the root node of the tree 
is fire universal cluster which contains all the documents. The 
aggolomerative hierarchial clustering method we appy is sin- 
gle linkage clustering. The defining feature of the method 
is that similarity between groups is defined as the similarity 
between the closest pair of objects, where only pairs consist- 
ing of one object from each group are considered. We set 
up a threshold on the similarity to obtain the parition which 
yielded file optimal result 


New Performance Measures for Recurring Anomalies 

The recurring anomaly detection problem can he decomposed 
into two parts: detecting recurring anomalies and clustering 
recurring anomalies, so thee is a need for different perfor- 
mance measures. Now we present a simple example to indi- 
cate file need for the new performance measure. 

Suppose we only have 10 anomaly documents. In the column 
“Algorithm” in table 1, we see that our algorithm groups the 
documents into 4 clusters. The column “Expert” shows the 
expert clustering results. 

Table 1. Simple clustering example for illustrating new 
performance measure 



Algorithm 

Expert 

Clusterl 

1, 2,5,6 

1.2,3, 4 

Cluster2 

3,4,7 

5,8 

ClusterZ 

9 

9,10 

Cluster 4 

10 



In this example the algorithm has made the following mis- 
takes: missing recurring anomaly 8; detecting non recurring 
anomalies 5 and 6; separating recurring anomalies 1, 2,3,4 
into two clusters; separating recurring anomalies 9,10 into 
two clusters and combining recurring anomalies 1,2,5 into 
one cluster. So we summarize the mistakes into four cate- 
gories: 1. missing recurring anomaly, 2.detecting non recur- 
ring anomaly. 3. separating same kind of recurring anomalies 
into different clusters. 4.combining different kinds of recur- 
ring anomalies into one cluster. The standard precision and 
recall measure can only characterize the first twb mistakes, 
so we need to devise another metric to measure the last two 
mistakes. In our problem, 

„ . . R+ 

Precision = — r rrr 

R+ +N+ 


Recall = 


R+ 

R+ + R- 


R + R~ N + and A r correspond to the number of documents 
that fall into the following categories 


Table 2. 



Labeled by Expert 

Not Labeled by Expert 

ueiecied 

T->-4- 

n • 

1 Y ' 

Not detected 

R~ 

N~ 


The number of anomalies which are both detected by algo- 
rithm and labeled by expert is 6. The number of anomalies 




detected by algorithm is 9, and the number of anomalies la- 
beled by expert is 8. So the precision is 0.67 and the recall is 
0.75. 

Precision and recall measure the accuracy of detecting recur- 
ring anomalies, but do not characterize the accuracy of clus- 
tering anomalies. Because the anomalies, which have not 
been either detected by algorithm and or labeled by expert, 
do not affect the accuracy of the clustering, we delete these 
anomalies. The remaining anomalies are shown in table2. 

Table 3. Simple clustering example for illustrating new 
performance measure (after deleting the documents which 
are not deteced both by algorithm and experts) 



Algorithm 

Expert 

Clusterl 

1,2,5 

1,2, 3,4 

Cluster2 

3,4 

5 

ClusterZ 

9 

9,10 

Cluster 4 

10 



To measure the mistakes that caused by separating same kind 
of recaning anomalies into different clusters, we add up the 
reciprocal of the number of splited clusters and normalized by 
the total number of clusters in expert result If the algorithm 
result exactly match the expert result, we get score 1. The 
score decreases as the number of splited cluster increases. 
The other point view of the miscombination by algorithm 
is misseparation by expert So we use the same scheme but 
based on algorithm result to calculate miscombination score. 
The method to score the misseparation and miscombination 
is defined as following, 


score for separation is l/2+l+l/2=2. To normalize the score) 
we divide it by the number of clusters in the expert result So 
the normalized misseparation score is 0.75. The miscombi- 
nation score is calculated in the inverse direction. 

Experimental Results 

The aerospace system we analyzed is the NASA Space Shut- 
tle problem reports as represented in the CARS dataset, which 
consists of 7440 NASA Shuttle problem reports. These re- 
ports come from the three subsystems. 

Some domain experts read the anomaly reports and provide 
a clustering results. According to their results , among total 
7440 reports, there are 1553 recurring anomalies, which are 
grouped into 366 clusters. Consequently, there are 7440 — 
1553 = 5887 single document clusters, which make this 
problem distinct. 



Figure 8. Comparing Precision and Recall Measure on 
CARS Data 
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where 

NS A = number of expert clusters' which contain the 
anomalies in each algorithm cluster 
NSE = number of expert clusters which contain the 
anomalies in each algorithm cluster 
NE = number of clusters in expert result 
NA = number of clusters in algorithm result 

It’s better to understand the measure scheme by explaining it 
with the example. The algorithm separates anomaly 1, 2, 3 
and 4 in expert cluster 1 into 2 clusters, so the misseparation 
score for this cluster is 1/2; the misseparation score for expert 
cluster 2 is 1; and the score for cluster 3 is 1/2. The overall 



Figure 9. Comparing Misseparation and Miscombination 
Measure on CARS Data 

Four s imilari ty measures rcosine distance, skew divergence, 
jenson-shanon divergence and joccard’s coefficient are com- 
pared on the CARS data set Figure 8 and Figure 9 summarize 
the effectiveness of four similarity measure schemes. 




ftte skew divergence based on word mixture model and the 
cosine distance are very effective. In general, they outper- 
forms all the other methods. The Joccard’s coefficient mea- 
sure is the least accurate. It is very suprise that the tradi- 
tional cosine similarity metric is very effective, because co- 
sine similarity is less well-justified theoretically than the lan- 
guage modeling approach. However, cosine similarity has 
been demonstrated many times and over many tasks to be a 
robust similarity metric. Our results add recurring anomaly 
detection to the long list for which it is effective. In die region 
, where recall ranges from 0.55 to 0.85, the skew divergence 
is most accurate. This region satisfies the user requirements: 
relatively high recall and low precision. 

To testify the effectiveness of the word mixture model, we 
compared the performance of skew divergence measure based 
on mixture model and general language model. The results 
are shown in Figure 10 and Figure 11. We see that the mixture 
model result is consistently more accurate than the general 
model. 



Figure 10. Comparing Precision and Recall Measure for 
Mixture Model 



Misseparation 


Figure 11. Comparing Misseparation and Miscombination 
for Mixture Model 


11. Conclusions and future work 

Difficult to Classify Anomalies: 

We presented an experimental comparison of the state of the 
art techniques for text classification, applied to the problem 
of classifying flight reports to predefined categories of occur- 
ring anomalies. Starting from the ”bag of word”, applying 
feature reduction techniques and using an S VM classifier, we 
obtain very good results for some anomalies in terms of both 
precision and recall. However, for some other anomalies this 
model does not produce such high levels of desired accuracy. 
As mentioned above, the problem with the harder to classify 
anomalies can be related to the initial "bag of words” where 
the terms picked for those anomalies by the natural language 
processing methods are not descriptive enough. We plan to 
investigate the initial reports contents and find NLP methods 
suited particularly to do better on the currently harder to clas- 
sify anomalies. We can also address the problem by making 
suggestions at the base level of how the reports themselves 
should be written, particularly when describing events such as 
those anomalies which are difficult to classify at the present 
time with the currently given ”bag of words”. 

Future direction: Semantics or Statistics? 

Semantics or statistics? This is a question which has puz- 
zled everyone working in text mining field. For Recurring 
anomaly detection on airplane problem reports , finding the 
semantics between documents is much more important than 
devising a good statistical language model. Because our data 
set has quite a few documents, which is written in a way such 
as ” this problem is similar to another problem”. Any statis- 
tical language model based on bag of word matrix does not 
embody such information. 

We call the word ’’similar to” ’’refer to ” as trigger word. If we 
could detect the documents which contain trigger word and 
also indicate a connection to other documents, we will have a 
tremendous improvement on the performance of our system. 
We checked the results and found that a large amount of the 
recurring anomalies which have not been detected by the al- 
gorithm are file documents that have trigger words. However, 
the algorithm also found quite a few recurring anomalies that 
the experts has not found, so we sent our results to the experts 
to reevaluate. 

To detect the documents which contain trigger word and also 
indicate a connection to other documents, we need to extract 
the information around the trigger word. Information extrac- 
tion is a well defined research area , and there are many tech- 
niques that we can apply to solve the trigger word problem. 
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